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Abstract. When outcomes are missing for reasons beyond an inves- 
tigator's control, there are two different ways to adjust a parameter 
estimate for covariates that may be related both to the outcome and to 
missingness. One approach is to model the relationships between the 
covariates and the outcome and use those relationships to predict the 
missing values. Another is to model the probabilities of missingness 
given the covariates and incorporate them into a weighted or strat- 
ified estimate. Doubly robust (DR) procedures apply both types of 
model simultaneously and produce a consistent estimate of the param- 
eter if either of the two models has been correctly specified. In this 
article, we show that DR estimates can be constructed in many ways. 
We compare the performance of various DR and non-DR estimates 
of a population mean in a simulated example where both models are 
incorrect but neither is grossly misspecified. Methods that use inverse- 
probabilities as weights, whether they are DR or not, are sensitive to 
misspecification of the propensity model when some estimated propen- 
sities are small. Many DR methods perform better than simple inverse- 
probability weighting. None of the DR methods we tried, however, im- 
proved upon the performance of simple regression-based prediction of 
the missing values. This study does not represent every missing-data 
problem that will arise in practice. But it does demonstrate that, in at 
least some settings, two wrong models are not better than one. 
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1. INTRODUCTION 

1.1 Purpose 

A new class of methods called doubly robust (DR) 
procedures was designed to mitigate selection bias 
arising from uncontrolled nonresponse and attrition, 
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nonrandom treatment assignment in observational 
studies and noncompliance in randomized experi- 
ments (Robins and Rotnitzky, 2001; van der Laan 
and Robins, 2003). DR methods require specifica- 
tion of two models: one that describes the popu- 
lation of responses, and another that describes the 
process by which the data are filtered or selected 
to produce the observed sample. The distinguish- 
ing feature of DR estimates is that they remain 
asymptotically unbiased if one of the two models 
is misspecified — that is, they consistently estimate 
their targets if either model is true (Robins and Rot- 
nitzky, 1995). 

DR methods are a refinement of a weighted esti- 
mating-equations approach to regression with in- 
complete data proposed by Robins, Rotnitzky and 
Zhao (1994, 1995) and Rotnitzky, Robins and Scharf- 
stein (1998). Further explanation and evaluation of 
DR estimators has been given by Lunceford and 
Davidian (2004), Carpenter, Kenward and Vanstee- 
landt (2006), Davidian, Tsiatis and Leon (2005) and 
Bang and Robins (2005). What is not widely known, 
however, is that the methods developed by Robins 
et al. are not the only way to achieve double robust- 
ness. Many other types of estimates possess the DR 
property. Generalized regression estimators devel- 
oped for sample surveys (Cassel, Sarndal and Wret- 
man, 1977), also known as model-assisted survey es- 
timators (Sarndal, Swensson and Wretman, 1989, 
1992), have this property, as does a new class of 
parametric methods developed by Little and An 
(2004) and several other methods that do not seem 
to have been described before. 

The first purpose of this article is pedagogical. 
We review a variety of incomplete-data estimation 
strategies, describe various ways in which the DR 
property can arise, and connect the recent articles 
on DR estimation to similar techniques found in the 
literature on sample surveys and causal inference. 
Our second purpose is to investigate the practical 
behavior of these estimators not only when either of 
the underlying models is correct, but in a scenario 
where both models are moderately misspecified. 

1.2 Description of the Problem 

For simplicity, we focus on the problem of estimat- 
ing a population mean from an incomplete dataset. 
Many of the methods we present have been applied 
to the more general problem of estimating population- 
average regression coefficients, and we will describe 
those extensions where appropriate. Let us suppose 
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Fig. 1. Schematic representation of sample data for estimat- 
ing (a) a population mean and (b) an average causal effect, 
with missing values denoted by shading. 

that we have a random sample of units i = 1, . . . , n 
from an infinite population. The variable of primary 
interest is t/j. Let tj be the response indicator for 
Hi, so that ti = 1 if Ui is observed and tj = if yi 
is missing. For each unit, there is an observed p- 
dimensional vector of covariates Xi that may be re- 
lated both to yi and to tj. A schematic represen- 
tation of the sample data is shown in Figure 1(a). 
(In this figure, the indices i = 1, . . . , n have been per- 
muted so that the sample units having ti = 1 appear 
first.) Denote the numbers of respondents and non- 
respondents by n^^^ = J^i^i and n^^^ = ~ U), 
the population response and nonresponse rates by 
r(i) = P{ti = 1) and r(°) = P{ti = 0), and the sam- 
ple rates by f^^^ = n^-^^ jn and r^''^ = n'^^^ jn. 
The sample mean of the observed y^'s, 

i 

consistently estimates the mean for the respondents, 
/i^^^ = E{yi I = 1), under any reasonable popula- 
tion distribution and response mechanism. An esti- 
mator having this property will be said to be strongly 
robust. In general, there is no strongly robust esti- 
mate of the mean for the nonrespondents, [i^^^ = 
E{yi I ti = 0), or for the mean of the entire popu- 
lation, ^ = E{yi) = r^'^') n^^) + ^^^\ based on the 
observed data alone. Naive estimates such as y^^^ 
may work well enough if f^^^ is small and the re- 
lationships among Xi, ti and yi are weak. As the 
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nonresponse rate and the strength of these relation- 
ships grow, adjusting for selection bias becomes im- 
portant, and inferences become sensitive to the as- 
sumptions underlying the adjustment. 

This problem is closely related to estimating an 
average causal effect from an experiment or obser- 
vational study. Suppose now that ti is an indicator 
of the treatment received by unit i. Associated with 
unit i is a pair of potential outcomes: the response 
Uii that is realized if = 1, and another response yio 
that is realized if tj = 0. This situation is depicted 
in Figure 1(b). The causal effect of the treatment 
on unit i, defined as yn — yio, is unobservable be- 
cause one of the two potential outcomes is neces- 
sarily missing. It is often of interest to estimate the 
average causal effect (ACE) in the population, 

ACE = E{ya)-E{yio), 

or the ACE's among the treated and the untreated, 

^C^(i) = E{ya \U = l)- E{yio \U = 1), 

ACE^^^ = E{ya \U = 0)- E{yio \ti = 0). 

The notion of potential outcomes was introduced 
by Neyman (1923) for randomized experiments and 
by Rubin (1974a) for nonrandomized studies. Re- 
views of causal inference from this perspective are 
given by Holland (1986), Winship and Sobel (2004), 
Gelman and Meng (2004) and Rubin (2005). Prom 
Figure 1(b), it is apparent that the correlation be- 
tween yn and yio [more precisely, the partial cor- 
relation between them given Xi] see Rubin (1974b)] 
cannot be estimated from the observed data. With- 
out prior information on what this correlation might 
be — and it is unclear from where such information 
would come — one may separate the problem of esti- 
mating an ACE into independent estimation of the 
means of yn and yiQ. Any of the methods described 
in this article can be used to estimate an average 
causal effect by applying the method separately to 
each potential outcome. When estimating ACE's, 
rates of missing information tend to be high and 
sensitivity to modeling assumptions may be acute. 

1.3 Assumptions 

Throughout this article, we will assume that the 
response mechanism is unconfounded in the sense 
that yi and U are conditionally independent given Xi 
(Rubin, 1978); this assumption has also been called 
strong ignorability (Rosenbaum and Rubin, 1983). 
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Under strong ignorability, the joint distribution of 
the complete data can be written as 

P{X,T,Y)=Y[P{xi)P{U I x,)P{yi I Xi). 

i 

Strong ignorability implies that the missing yj's are 
missing at random (MAR) (Rubin, 1976; Little and 
Rubin, 2002). In many applications, MAR is unreal- 
istic. Nevertheless, this assumption provides an im- 
portant benchmark and point of departure for sensi- 
tivity analyses, and it is the foundation upon which 
DR procedures rest. 

None of the methods we examine will require an 
explicit model for the covariates, but they will all 
make assumptions about P{ti \ Xi), P{yi \ Xi) or both. 
Denote the response probability for unit i by 

(1) P(ti = 1 I Xj) =7rj(xj) =7rj. 

This probability is called the propensity score (Rosen- 
baum and Rubin, 1983). A proposed functional form 
for (1) will be called a vr-model. If estimates of the 
propensity scores are needed, they are often taken 
to be 

TTi = expit(xf a) = ^^v{x^ a) 
1 -|- exp(a;^ a) 

where a is the maximum-likelihood (ML) estimate 
of the coefficients from the logistic regression of ti, . . . , 
tn on xi, . . . ,rE„. In many situations, of course, the 
assumed form of the vr-model is not correct, and 
this misspecification can be problematic depending 
on how the TTj's are used. 
Let us also define E{yi \ Xi) = m{xi) = mj, so that 

(2) yi = m{xi) + ei 

with E[ei) =0. A functional form for m[xi) will be 
called a y-model. When an estimate of mj is needed, 
an obvious candidate is rhi = xf(3, where (5 is the 
vector of coefficients from the linear regression of 
yi on Xi estimated from the respondents; y-models 
with nonlinear link functions are also straightfor- 
ward (McCullagh and Nelder, 1989). In most cases, 
an analyst's regression model will be only a rough 
approximation to the true y-model. The implica- 
tions of this misspecification can be serious if P{xi \ 
ti = 1) and P{xi \ti = 0) are very different, because 
the rfij's for the nonrespondents will then be based 
on extrapolation. This is particularly true when Xi is 
high-dimensional because of the so-called curse of di- 
mensionality. With many covariates, it becomes dif- 
ficult to specify a y-model that is sufficiently flexible 
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to capture important nonlinear effects and interac- 
tions, yet parsimonious enough to keep the variance 
of prediction manageably low. 

Under ignorability, the vTj's play no role in like- 
lihood-based or Bayesian analyses for the parame- 
ters of the y-model (Rubin, 1976). The parametric 
approach — specifying a full model for m and ignor- 
ing the VTj's — is emphasized in texts on missing data 
by Rubin (1987), Schafer (1997), Little and Rubin 
(2002) and others. Nevertheless, many advocates of 
the parametric approach also recognize that the vTj's 
are useful for model validation and criticism (Gel- 
man, Carlin, Rubin and Stern, 2004, Chapters 6-7). 
On the other hand, much of the literature on causal 
inference in the tradition of Rosenbaum and Rubin 
(1983) eschews models for yi in favor of matching 
and stratification based on propensity scores (e.g., 
Rosenbaum, 2002). The latter is motivated in part 
by a perceived inability to model the responses well 
enough to mitigate the dangers of extrapolation. 
Other uses for propensity scores, including inverse- 
propensity weighting (Robins, Rotnitzky and Zhao, 
1994), also rely heavily on a 7r-model while relax- 
ing assumptions about yi. DR methods will require 
both a y-model and a vr-model but remain consis- 
tent if one or the other is wrong. As we shall see, 
however, this property does not necessarily trans- 
late into improved performance when both models 
fail. 

1.4 A Simulated Example 

Consider the following example which, although 
artificial, bears some resemblance to what we have 
encountered in a real study. For each unit i = 1, . . . ,n, 
suppose that {zn, Zi2, Zis, Zi^)"^ is independently dis- 
tributed as A^(0, /) where / is the 4x4 identity ma- 
trix. The yj's are generated as 

yi = 210 + 27Azii + 13.7zi2 + 13.7zi3 + 13.7^^4 + et, 

where ~ A^(0,1), and the true propensity scores 
are 

TTj = expit(-2;ji + 0.5zj2 - 0.25zi3 - O.lzi^). 

This mechanism produces an average response rate 
of = 0.5, and the means are = 210.0, //^^^ = 
200.0 and = 220.0. The selection bias in this 
example is not severe; the difference between the 
mean of the respondents and the mean of the full 
population is only one-quarter of a population stan- 
dard deviation. Nevertheless, this difference is large 
enough to wreak havoc on the performance of the 



naive estimate y^^^ when used as the basis for con- 
fidence intervals and tests. 

In this example, a logistic regression of ti on the 
Zjj's would be a correct vr-model, and a linear regres- 
sion of yi on the Zij 's would be a correct y-model. We 
will suppose, however, that instead of being given 
the Zij^s, the covariates actually seen by the data 
analyst are 

Xil = exp{zii/2), 

Xi2 = Ziil (1 + exp(2;ii)) + 10, 
a;i3 = (^ii2i3/25 + 0.6)3, 

Xi^ = {z2^Z^^2^f. 

This implies that logit(vrj) and rrij are linear func- 
tions of log(xii), X2, x\x2, l/log(xi), X3/log(j;i) 

1/2 

and X4 . Except by divine revelation, it is unlikely 
that an analyst who sees only Xi would ever for- 
mulate a correct vr- or y-model. Rather, he or she 
would naturally be drawn to models that are linear 
and logistic in the Xjj's, and those incorrect models 
look trustworthy. To illustrate, we drew a random 
sample of n = 200 units from this population, which 
happened to produce exactly = 100 respondents 
and m!^^^ = 100 nonrespondents. Scatterplots of yi 
versus Xij, j = 1, . . . ,4, for the 100 respondents are 
shown in Figure 2. Regressing yi on the Xjj's yields 
coefficients for xn, Xis and Xj4 that are highly sig- 
nificant and a coefficient for Xi2 that is nearly sig- 
nificant at the 0.05-level; the prediction is strong 
(-R^ = 0.81), and a plot of residuals versus fitted val- 
ues reveals no obvious outliers and little evidence 
of heteroscedasticity or nonlinearity. (In other sam- 
ples, some higher-order terms such as xf and X1X2, 
are significant and might be considered for inclusion 
in the model. Those terms, however, do little to im- 
prove the performance of any of the regression-based 
methods discussed below, and sometimes they are 
harmful.) 

The covariates seen by the analyst are also related 
to the tj's. Side-by-side boxplots of the Xjj's for the 
ti = and ti = 1 groups are shown in Figure 3(a)- 
(d). Fitting a logistic model to this sample, we find 
that the coefficients for xn and Xi2 are statistically 
significant, and all of the deviance residuals lie be- 
tween — 1.85 and +2.51. Figure 3(e) shows side-by- 
side boxplots of the linear predictors fji = logit(vri). 
As one would expect, the distributions of vTj in the 
two groups are different but not drastically so. To 
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check the appropriateness of the hnk function, Hink- 
ley (1985) suggested adding fjf as another covari- 
ate to see whether it is related to the response; the 
coefficient for this extra term was not significantly 
different from zero {p = 0.20), so in this sample an 
analyst would have little reason to alter the link. 

Comparing the fitted values of yi under the true 
and misspecified y-models, we find that the correla- 
tion between them is approximately 0.9. Similarly, 
the correlation between the r/j's under the true and 
misspecified vr-models is also about 0.9. This exam- 
ple appears to be precisely the type of situation for 
which the DR estimators of Robins et al. were de- 
veloped. By relying on two reasonably good models, 
one hopes that at least one is close enough to the 
truth to yield satisfactory results. Indeed, Bang and 
Robins (2005, Section 2.1) state: 

In our opinion, a DR estimator has the follow- 
ing advantage that argues for its routine use: if 
either the [y-model] or the [vr-model] is nearly 
correct, then the bias of a DR estimator of /i 
will be small. Thus, the DR estimator . . . gives 
the analyst two chances to get nearly correct 
inferences about the mean of Y . 

In Sections 2 and 3, we describe various techniques 
for estimating ^ based on the observed data and 
evaluate their performance in this simulated exam- 
ple. Some of these methods use a 7r-model, some 
use a y-model, and some rely on both. All of the 
dual-modeling strategies possess a DR property, but 
they do not perform equally well. Pooling informa- 
tion from two models can be helpful, but the manner 
in which the information is pooled makes a differ- 
ence. (Due to space limitations, we will not discuss 
computation of standard errors. Tractable variance 
estimates are available for most of these methods, 
but our purpose is to compare the performance of 
the estimates themselves.) In Section 4, we will pro- 
vide further justification for why we constructed our 
example as we did, and we will outline the cru- 
cial differences between this and simulated exam- 
ples used by Bang and Robins (2005) and others, 
so that apparently contradictory conclusions can be 
reconciled. 

2. WEIGHTING, STRATIFICATION 
AND REGRESSION 

2.1 Inverse-Propensity Weighting 

Weighting observed values by inverse-probabilities 
of selection was proposed by Horvitz and Thomp- 



son (1952) in the context of survey inference for fi- 
nite populations. The same idea is used in impor- 
tance sampling, a Monte Carlo technique for ap- 
proximating the moments of a distribution using 
random draws from another distribution that ap- 
proximates it (Hammersley and Handscomb, 1964; 
Geweke, 1989). It is easy to see that n~^J2i'ti'^7^yi: 
which can be computed from the respondents alone, 
unbiasedly estimates the mean of the entire popula- 
tion, because strong ignorability implies that 

E{ti7rr^yi) = E{E{tiTT~^yi \ Xi)) 

= E{-Ki'K~^mi) = ji. 

Precision is often enhanced if we use a denominator 
of '^itiTi~^ rather than n, so that the estimate be- 
comes a weighted average of the y^'s for the respon- 
dents. Normalizing the weights in this manner, and 
replacing the unknown propensities by estimates de- 
rived from a vr-model, the inverse-propensity weighted 
(IPW) estimate becomes 



(3) 



tJ-IPW-POP 



EitiTTi ^yi 



The ^^POP" in the subscript indicates that we are 
reweighting the respondents to resemble the full pop- 
ulation. Alternatively, we may reweight them to ap- 
proximate the population of nonrespondents (Hi- 
rano and Imbens, 2001). An estimate of i-i^^^ based 
on that idea is 

40) _ J2itiK^{l-TT)yi 

^^IPW-NR- ^ 4 ^-in ^\ ' 
and a corresponding estimate of /i is 

(4) fiiPW-NR = r^'^y^'^ + r^'^^fifi^w-NR- 

The IPW estimator can also be regarded as the 
solution to a simple weighted estimating equation 
WiUi = 0, where Wi = Utx'^ and Ui = {yi - i-L)/a'^ 
for any o"^ > 0. From this standpoint, the IPW method 
can be generalized to estimate a vector of population- 
average coefficients for the regression of yi on an ar- 
bitrary set of covariates. In the regression setting, 
Ui becomes a vector representing the ith unit's con- 
tribution to a quasi-score function. Coefficients esti- 
mated in this manner are consistent and asymptot- 
ically normally distributed provided that the vTj's 
are bounded away from zero and the vr-model has 
been correctly specified. Asymptotic properties and 
methods for variance estimation are described by 
Binder (1983) when the propensities are known, and 
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Fig. 2. Scatterplots of response versus observed covariates for respondents in a sample of 200 units. 



by Robins, Rotnitzky and Zhao (1994, 1995) when In the original method of Horvitz and Thomp- 
the propensities have been estimated. son (1952), the vTj's were determined by a known 
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survey design. Surveys are usually designed to en- 
sure that IPW estimates are acceptably precise, but 
the forces of nature that govern uncontrolled nonre- 
sponse are often unkind. In missing-data problems, 
IPW methods assign large weights to respondents 
who closely resemble nonrespondents, causing the 
estimates to have high variance. IPW estimates are 
also sensitive to misspecification of the vr-model, be- 
cause even mild lack of fit in outlying regions of the 
covariate space where vrj « translates into large 
errors in the weights. These shortcomings of IPW 
estimators have been known for many years; see, for 
example, the comments of Little and Rubin (1987, 
Section 4.4.3) on the IPW methods for nonresponse 
proposed by Cassel, Sarndal and Wretman (1983). 

To see how well IPW performs on the artificial 
population described in Section 1.4, we created 1000 
samples of n = 200 and n = 1000 units each and, for 
each sample, estimated the propensity scores in two 
ways. First, we fit a correctly specified vr-model, re- 
gressing the ti's on zn, Zi2, Zis and Zj4 using a logit 
link. Second, we fit the incorrect model which re- 
places the Zij^s with Xjj's. The behavior of the four 
IPW estimates (3)-(4) under the correctly and in- 
correctly specified vr-models is summarized in Ta- 
ble 1. In this table, "Bias" is the average difference 
between the estimate and fi = 210, and "% Bias" 
is the bias as a percentage of the estimate's stan- 
dard deviation. (A useful rule of thumb is that the 
performance of interval estimates and test statis- 
tics begins to deteriorate when the bias of the point 
estimate exceeds about 40% of its standard devia- 
tion.) "RMSE" is square root of the average value 
of (/t — /u)^. Examining Table 1, we see that the 
IPW estimates are biased when the vr-model is mis- 
specified, and the biases are accompanied by huge 



losses in precision. In fact, the bias and RMSE ac- 
tually get worse as the sample size grows! IPW es- 
timates have higher variance than other procedures 
examined in this article even when the propensities 
are correctly modeled, but when the vr-model is in- 
correct, the method breaks down. Interestingly, NR 
weighting performs better than POP weighting; we 
do not know whether the superiority of NR is a pe- 
culiar feature of this example or if it tends to hold 
more generally. 

The poor performance of IPW is due in part to 
occasional highly erratic estimates produced by a 
few enormous weights. In practice, a good data an- 
alyst would never use a simple IPW estimator if the 
weights were too extreme. Unusually large weights 
may be taken as a sign of model failure, prompt- 
ing the researcher to revise the vr-model. Outlying 
weights may be truncated or shrunk to more sensi- 
ble values, or the offending units with large weights 
may be removed. (Removing these units is not rec- 
ommended, because the respondents with the low- 
est propensities are in fact those that contain the 
best information for predicting nonrespondent be- 
havior.) Analysts who apply IPW to real problems 
quickly learn that it often cannot be used without 
ad hoc modifications. The column of Table 1 labeled 
"MAE" reports the median of the absolute errors 
\fi — which discards the worst 50% of the esti- 
mates. Even by this robust measure of precision, 
IPW performs more poorly than the other methods 
we examine when the vr-model is misspecified, and 
it fails to improve as the sample size grows. 

In many applications of IPW methodology, weights 
are obtained by logistic regression. Logistic models 
can be a poor way to estimate response propensities, 
because ML estimates from the logistic model are 



Table 1 

Performance of IPW estimators of ji over 1000 samples from the artificial 

population 



Sample size 


TT-model 


Method 


Bias 


% Bias 


RMSE 


MAE 


(a) n = 200 


Correct 


IPW-POP 


-0.27 


-7.0 


3.86 


2.43 






IPW-NR 


-0.29 


-8.2 


3.60 


2.36 




Incorrect 


IPW-POP 


1.58 


19.2 


8.35 


3.32 






IPW-NR 


0.61 


10.3 


5.99 


3.03 


(b) n = 1000 


Correct 


IPW-POP 


-0.01 


-0.5 


1.81 


1.16 






IPW-NR 


-0.03 


-1.5 


1.68 


1.09 




Incorrect 


IPW-POP 


5.05 


45.9 


12.10 


2.80 






IPW-NR 


3.22 


49.1 


7.29 


2.34 
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not resistant to outliers (Pregibon, 1982). A promis- 
ing alternative is the robit model, which replaces the 
logistic link by the cumulative distribution function 
of a Student's t-distribution with v degrees of free- 
dom (Albert and Chib, 1993). The logit link is well 
approximated by the robit with v = 1 ^ and smaller 
values of v lead to estimates that are more robust 
(Liu, 2004). In this example, robit models produce 
minor improvements when the covariates are cor- 
rect and major improvement when the covariates are 
wrong. We found that, with samples of n = 1000, re- 
placing the logit link by robit with i/ = 4 reduces the 
bias and RMSE by nearly 50% when the vr-model is 
incorrect. If IPW must be used, replacing the logis- 
tic regression with a more robust procedure can be 
advantageous. 

2.2 Propensity- Based Stratification 

To mitigate the dangers of extreme weights and 
misspecification of the vr-model, some prefer to coars- 
en the estimated propensity scores into a few cate- 
gories and compute weighted averages of the mean 
response across categories. In the context of survey 
nonresponse, the technique is known as weighting- 
cell estimation or adjustment (Oh and Scheuren, 
1983; Little, 1986; Little and Rubin, 2002). Strong 
ignorability implies that yi and are conditionally 
independent within any subpopulation for which 
■Ki{xi) is constant (Rosenbaum and Rubin, 1983). 
In classes of constant propensity, the mean values 
of yi for respondents and nonrespondents are equal, 
which implies that 



(5) 



J E{yi \Tri,ti = l)dP{TTi 



Suppose we fit a vr-model and define strata s = 1, . . . , 
S by grouping units whose vrj's are similar. Define 
Cis = 1 if unit i belongs to stratum s and other- 
wise. The vr-stratified estimate of fj, approximates 
(5) by a weighted average of respondents' mean in 
each stratum, weighted by the proportion of sample 
units in that stratum. 



(6) 



l-^strat- 



n 



Similarly, a vr-stratified estimate of fi^^^ weights the 
respondents' mean in each stratum by the propor- 
tion of nonrespondents in that stratum. 



r^strat-TT / y 
s=l 



which may be combined with ]f^^ as in (4) to pro- 
duce another estimate of \i. Rosenbaum and Rubin 
(1983) suggest classifying units into 5 = 5 strata de- 
fined by the sample quintiles of vTj, as this tends to 
eliminate more than 90% of the selection bias if the 
vr-model is correct (Cochran, 1968). 

The performance of the vr-stratified estimator (6) 
over the 1000 samples from the artificial population 
is summarized in Table 2. Comparing these results 
to those of Table 1, we see that stratification is less 
effective than IPW at removing bias when the vr- 
model is correct, but the stratified estimators still 
outperform IPW in terms of RMSE. The increase in 
bias incurred by coarsening the vrj's is easily offset 
by the greater efficiency that results from stabilizing 
the largest weights. When the vr-model is misspeci- 
fied, the stratified estimators are more biased than 
IPW for samples of n = 200 but less biased when 
n = 1000, because the bias of the stratified estima- 
tors does not worsen as n increases. 

2.3 Regression Estimation 

IPW and vr-stratified estimators pay no heed to 
relationships between the covariates and y^. Regres- 
sion estimators, on the other hand, model yi from 
Xi directly and use this information to predict the 
missing values. Because strong ignorability implies 
that 

E{yi I Xi.ti = 0) = E{yi \ Xi,ti = 1) = E{yi \ Xi), 

we can regress yi on Xi among the respondents, ap- 
ply the estimated regression function to predict yi 
for the entire sample, and then average the predicted 
values to obtain an estimate of fi. Let 



/3= E*, 



denote the ordinary least-squares (OLS) coefficients 
from the regression of yi on Xi among the respon- 
dents, and let rhi = xff3. The OLS regression esti- 
mate for /i is 

(7) AoLS = rE' 



mi. 



n 



This estimate is unbiased if the y-model is true, that 
is, if E{yi I Xi) = xJ f5 for some [3 G 7^^; in addition, 
it is highly efficient if af = V{yi\xi) is nearly con- 
stant. If the response is heteroscedastic, efficiency 
can be improved by replacing (3 with a weighted 
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least-squares estimate with weights proportional to 

From our 1000 samples, we computed OLS regres- 
sion estimates under a correct y-model (regressing 
Ui on the Zjj's) and an incorrect y-model (regressing 
Ui on the Xjj's). The results, which are summarized 
in Table 3, verify that the bias is indeed removed 
when the y-model is correct but not when the model 
is wrong. Comparing the RMSE values in this table 
with those in Tables 1 and 2, we see that estimates 
based on the incorrect y-model are more stable and 
efficient than those based on the incorrect vr-model. 
The bias that remains due to misspecification of the 
?/-model is not large in absolute terms, but it is still 
troubling in samples of n = 1000 because there it 
amounts to more than 50% of a standard error and 
begins to impair tests and intervals. Difficulties with 
parametric missing-data methods arise when the un- 
certainty due to the model specification, which is 
rarely accounted for, grows relative to the sampling 
variation under the assumed model. In those cases, a 
point estimate based on a misspecified but reason- 
able y-model may still perform better than other 
estimates, but the analyst is too optimistic about 
its precision. 

The performance of the regression estimate de- 
pends heavily on the strength of the correlation R 
between yi and m^. As B? approaches 0, jloLS con- 
verges to y^^^ and suffers from the same bias as this 
naive estimate. As E? approaches 1, it converges to 



the mean of the full sample, y = n J2iyi^ which is 
strongly robust. Therefore, if the y-model has strong 
prediction, the regression estimator tends to domi- 
nate other methods in terms of bias, efficiency and 
robustness. The IPW and vr-stratified estimators, on 
the other hand, break down as the predictive ability 
of the vr-model increases. 

2.4 Stratification by Propensity Scores and 
Predicted Values 

We saw that the efficiency and robustness of an 
IPW estimator can be enhanced by coarsening the 
TTi^s into a small number of categories. The efficiency 
of these estimators can be further increased by adding 
covariates to the vr-model that are predictive of yi, 
even if these covariates are unrelated to ti (Lunce- 
ford and Davidian, 2004). Vartivarian and Little (2002) 
show that further improvement is possible if we cross- 
classify units by estimated propensity scores and co- 
variates that are strongly related to the outcome. 
The idea of combining estimated propensity scores 
with additional covariate information is not new. 
For example, Rosenbaum and Rubin (1985) recom- 
mended matching respondents to nonrespondents or 
vice versa by a Mahalanobis-distance criterion based 
on Xi within calipers defined by the estimated propen- 
sity scores. Numerous authors have performed re- 
gression adjustments based on Xi within strata de- 
fined by vr; for references, see D'Agostino (1998). 

To illustrate a simple version of this idea, suppose 
that we create 25 strata by cross-classifying units 



Table 2 

Performance of propensity-stratified estimators over 1000 samples from the 

artificial population 



Sample size 


TT-model 


Method 


Bias 


% Bias 


RMSE 


MAE 


(a) n = 200 


Correct 


strat-TT 


-1.15 


-38.1 


3.22 


2.17 




Incorrect 


strat-Tv 


-2.82 


-87.7 


4.28 


3.13 


(b) n = 1000 


Correct 


strat-Tv 


-1.08 


-81.5 


1.71 


1.18 




Incorrect 


strat-TT 


-2.87 


-202.7 


3.19 


2.83 



Table 3 

Performance of ordinary least-squares regression estimators over 1000 samples from 

the artificial population 



Sample size 


y-model 


Method 


Bias 


% Bias 


RMSE 


MAE 


(a) n = 200 


Correct 


OLS 


-0.08 


-3.4 


2.48 


1.68 




Incorrect 


OLS 


-0.57 


-17.7 


3.26 


2.24 


(b) n = 1000 


Correct 


OLS 


-0.00 


-0.1 


1.17 


0.79 




Incorrect 


OLS 


-0.84 


-56.0 


1.72 


1.15 
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into cells defined by the sample quintiles of vTj and 
rhi, and then estimate /i by a weighted average of 
the mean response across the cells. If we apply this 
procedure over repeated samples, we will occasion- 
ally encounter a cell that contains nonrespondents 
but no respondents, in which case the stratified es- 
timates are undefined. For those samples, we must 
modify the estimate in some fashion. For example, 
we may collapse adjacent rows or columns in the 
5x5 table, fuse the offending cell with an adjacent 
cell, impute the missing cell mean by a regression 
estimate that assumes the response surface has row 
and column effects but no interactions, and so on. 

In general, we dislike procedures that require fre- 
quent ad hoc adjustments unless their operating char- 
acteristics are well understood and the variance of 
the adaptive estimator can be approximated. Never- 
theless, it is interesting to see how well the method 
performs in our artificial example. For each of our 
samples, we computed (tt x m)-stratified estimators 
using all four possible combinations of correct and 
incorrect tt- and y-models. Missing cell means were 
imputed by a regression procedure that assumes no 
row-by-column interactions. The results, which are 
summarized in Table 4, show that dual stratification 
produces a crude kind of double robustness; the bias 
is relatively low if the vr-model is correctly specified 
or if the ?/-model is correctly specified. Under strong 
ignorability, a stratified estimate of the population 
mean will be unbiased if either the true vrj's or the 
true mj's are constant within strata. 

It is also useful to compare the results in Ta- 
ble 4 where both models are incorrect with those of 
Table 2 where the vr-model is incorrect. This com- 
parison shows that a vr-stratified estimator can be 
improved with predicted values for the y^'s, even 
if those predictions come from a coarsened, mis- 
specified y-model. The key idea of DR estimation — 
reducing your reliance on one model by specifying 
two — does produce modest gains in this example 
over estimates based on a vr-model alone. Comparing 
Tables 3 and 4, however, we find that the approx- 
imate DR procedure based on two incorrect mod- 
els performs worse than OLS based on the incorrect 
y-model. When neither model is exactly true, two 
models are not necessarily better than one. 



3. CONSTRUCTING DOUBLY ROBUST 
ESTIMATES 

3.1 Regression Estimation with Residual 
Bias Correction 

Cassel, Sarndal and Wretman (1976, 1977) intro- 
duced a family of "generalized regression estima- 
tors" for population means and totals that combine 
model-based predictions for yi with inverse-probabil- 
ity weights. These methods, which are part of a 
methodology called model-assisted survey estima- 
tion (Sarndal, Swensson and Wretman, 1992), are 
highly efficient when the y-model is true, yet remain 
asymptotically unbiased when the y-model is mis- 
specified. In the original formulation, the response 
probabilities were a known feature of the survey de- 
sign, but with uncontrolled nonresponse the propen- 
sities may be estimated under a vr-model (Cassel, 
Sarndal and Wretman, 1983). 

To understand how these generalized regression 
estimators work, consider the simple regression esti- 
mator (7). If the regression model holds in the sense 
that E{yi\xi) = xj (3 for some (3 G TiP , then on av- 
erage the predictions rhi = j3 will be neither too 
high nor too low; the mean of the estimated resid- 
uals ei = yi — rhi in the population will be zero. Of 
course, residuals are seen only for sampled respon- 
dents. We can, however, consistently estimate the 
mean residual for the full population if we have ac- 
cess to a vr-model, and this estimate can in turn be 
used to correct the OLS estimate for bias arising 
from y-model failure. Cassel, Sarndal and Wretman 
(1976) proposed the bias-corrected estimate 

(8) P^BC-OLS = P'OLS + ~ tiTT~^£i. 

i 

Notice that if the y-model is true, then E{ei) = 0, 
and the second term on the right-hand side of (8) has 
expectation zero for arbitrary vTj's. If the vr-model is 
true, then the second term consistently estimates 
(minus one times) the bias of the first term. There- 
fore, this estimate is doubly robust. 

Many variations on this approach are possible. For 
example, we can normalize the weights in the cor- 
rection term, so that the estimate becomes 

, HiUT^'^ei 

Eitiir ^ 

Or we can replace the POP weights with NR weights, 
so that the correction term estimates the mean resid- 
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Table 4 

Performance of propensity and fitted-value stratified estimators over 1000 samples 
from the artificial population 



Sample size 


TT-model 


y-model 


Method 


Bias 


% Bias 


RMSE 


MAE 


(a) n = 200 


Correct 


Correct 


strat-Tvm 


-0.34 


-11.6 


2.92 


1.90 






Incorrect 


strat-Tvm 


-0.59 


-18.4 


3.25 


2.19 




Incorrect 


Correct 


strat-Tvm 


-0.49 


-17.1 


2.89 


1.96 






Incorrect 


strat-nm 


-2.00 


-61.4 


3.82 


2.62 


(b) n = 1000 


Correct 


Correct 


strat-Tim 


-0.27 


-21.1 


1.31 


0.87 






Incorrect 


strat-nm 


-0.45 


-33.7 


1.42 


0.92 




Incorrect 


Correct 


strat-Tvm 


-0.51 


-39.6 


1.38 


0.92 






Incorrect 


strat-Tvm 


-2.10 


-148.7 


2.53 


2.11 



ual in the population of nonrespondents. A bias- 
corrected estimate of /i'^'^^ based on this idea is 

which can be combined with y^^^ as in (4) to produce 
another DR estimate for fl. A third possibihty is 
to replace the weighted correction term with a tt- 
stratified estimate of E{ei). Using five strata based 
on sample quintiles of tTj would remove over 90% of 
the bias from the OLS estimate if the 7r-model were 
true and reduce problems of instability caused by 
very large weights. 

A more general version of (8) was independently 
proposed by Robins, Rotnitzky and Zhao (1994) for 
estimating population-average regression coefficients 
from incomplete data. Suppose Ui is the contribu- 
tion of sample unit i to a vector-valued quasi-score 
function for the regression of Ui on an arbitrary 
set of covariates. As noted in Section 2.1, the so- 
lution to J2i'>JJiUi = with Wi = ti%~^ provides a 
consistent and asymptotically normal estimate of 
the population-average regression coefficients if the 



model used to estimate the vrj's is correct. Now sup- 
pose that we change the estimating equations to 
Y.i[wiUi + (1 - Wi)(j)i] = 0, where ct)i = 4>{xi) is an ar- 
bitrary term that may depend on but not on yi. 
The mean of the additional term {1 — Wi)(j)i is essen- 
tially zero if the vr-model is true, because E{wi) = 
E{E{wi I Xi)) = E{Tt~^7ri) ^ 1. Therefore, the solu- 
tion to these augmented inverse-probability weighted 
(AIPW) estimating equations is again consistent and 
asymptotically normally distributed under a correct 
TT-model. Robins and Rotnitzky (1995) demonstrate 
that a judicious choice for (pi can greatly improve 
upon the efficiency of the simple IPW estimator. 
In particular, choosing (pi = E{Ui \ Xi), where the 
expectation is taken with respect to the distribu- 
tion for yi given Xi, produces a locally semiparamet- 
ric efficient estimator, the most efficient estimator 
within this class. This estimate is DR, maintaining 
its consistency if either the vr-model or y-model is 
correct (Scharfstein, Rotnitzky and Robins, 1999). 
Taking Ui = {yi — /i)/(T^, and estimating E{yi \ Xi) 
by (rfij — fl) /cj^ where rhi = xj/], the solution to the 



Table 5 

Performance of bias- corrected regression estimators over 1000 samples from the 

artificial population 



Sample size 


TT-model 


i/-model 


Method 


Bias 


% Bias 


RMSE 


MAE 


(a) n = 200 


Correct 


Correct 


BC-OLS 


-0.08 


-3.4 


2.48 


1.68 






Incorrect 


BC-OLS 


0.25 


7.5 


3.28 


2.17 




Incorrect 


Correct 


BC-OLS 


-0.08 


-3.3 


2.48 


1.70 






Incorrect 


BC-OLS 


-5.12 


-43.0 


12.96 


3.54 


(b) n = 1000 


Correct 


Correct 


BC-OLS 


0.00 


-0.1 


1.17 


0.79 






Incorrect 


BC-OLS 


0.06 


3.4 


1.75 


1.02 




Incorrect 


Correct 


BC-OLS 


-0.02 


-1.4 


1.49 


0.80 






Incorrect 


BC-OLS 


-21.03 


-13.5 


157.21 


5.32 



12 



J. D. Y. KANG AND J. L. SCHAFER 



AIPW estimating equation reduces to the general- 
ized regression estimator (8). More generally, it be- 
comes the solution to 

(9) -VC/. + -Vti7rri(C/,-;7,) = 0, 
n ^ n ^ 

where Ui is the quasi-score function Ui with yi re- 
placed by rhi. 

By analogy to (8), (9) can be viewed as a predicted 
estimating equation with a residual bias correction. 
Any of the variations on (8) described above — normal- 
izing the weights, switching to NR weights, or switch- 
ing from an IPW bias correction to a vr-stratified 
one — can be applied to (9) as well. Modifications 
like these would take the estimator outside of the 
AIPW class. Nevertheless, these changes could po- 
tentially improve performance when either or both 
models are misspecified. 

The performance of the bias-corrected regression 
estimate (8) in our simulated example is summa- 
rized in Table 5. The bias of this estimate does in- 
deed vanish when either of the two models is correct. 
Moreover, comparing these results to those from the 
IPW-POP estimates in Table 1, we see that aug- 
menting the IPW procedure by information from a 
correct y-model does indeed increase the efficiency. 
In the more realistic condition where both models 
are misspecified, however, this DR estimate does 
worse than IPW. A similar pattern emerges if we 
compare the new results to those from the simple 
OLS regression estimate in Table 3. Bias from an in- 
correct y-model is repaired by a vr-model if the latter 
is correct. When both models are misspecified, how- 
ever, the DR procedure is substantially worse than 
OLS. Like IPW, estimates from this DR procedure 
often behave erratically because one or more weights 
are occasionally enormous. Even if we trim away 
these "bad" samples and judge the performance by 
the MAE, however, the new procedure is still worse 
than IPW and OLS. Once again, two models are not 
necessarily better than one. 

Why does this DR estimate fail to perform bet- 
ter than IPW and OLS even though the vr- and y- 
models are reasonably close to being true? The lo- 
cal semiparametric efficiency property, which guar- 
antees that the solution to (9) is the best estimator 
within its class, was derived under the assumption 
that both models are correct. This estimate is in- 
deed highly efficient when the vr-model is true and 
the y-model is highly predictive. In our experience. 



however, if the vr-model is misspecified, it is not dif- 
ficult to find DR estimators that outperform this 
one by venturing outside the AIPW class. For this 
particular example, normalizing the POP weights, 
switching to NR weights, and using a vr-stratified 
bias correction all improve upon fiBC-OLS- There are 
yet more ways to construct DR estimates, as we now 
describe. 

3.2 Regression Estimation with 

Inverse-Propensity Weighted Coefficients 

The correction term in fj-BC-OLS repairs the bias in 
fj-OLS = J2i ^fP by estimating the mean residual 
in the full population. A different way to repair this 
bias is to move the estimated coefficients away from 
p. Imagine that we could see the OLS coefficients 
based on the full sample, 

PsAMP = (j2 Xi^I^ ^iVi^ ■ 

A well-known property of OLS regression is that 
the sum of the estimated residuals yi — xJPsamp, 
i = 1, . . . ,n, is zero if Xi includes a constant. This 
is an algebraic identity that holds regardless of the 
actual form of E{yi | Xi). This identity implies that 
the regression estimator based on (3sAMP replicates 
the mean of yi in the full sample, 

- ^IPsamp = -Yyi = y^ 

n ^ n ^ 

which is a strongly robust estimate of /i. We can- 
not compute PsAMP from the observed data. But 
with propensities estimated from a vr-model, we can 
compute a weighted least-squares (WLS) estimate 

PwLS = (y ti7t:~^Xixf^ tiTr:r^Xiyi^ . 

In a well-behaved asymptotic sequence, Psamp and 
PwLS both converge to the coefficients from the lin- 
ear regression of yi on Xi in the full population, re- 
gardless of whether that regression is an accurate 
portrayal of E{yi \ Xi). If we compute a regression 
estimate for /i based on the WLS coefficients, 

(10) fl.WLS = -Y^'i l^WLS, 

I 

the difference between this estimate and y converges 
in probability to zero as n — > cx3, provided that the 
vr-model holds. 
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From this discussion, we see that fiwLS consis- 
tently estimates /i if the vr-model is true. If the y- 
model is true, then PwLS will be an inefficient but 
consistent estimate of f3, and ft wis will again be 
consistent; thus it is DR. 

In our simulated example, the WLS regression es- 
timate is sometimes inferior to the bias-corrected 
OLS estimate (8) when one of the models is true, 
but much better when both models are misspecified 
(Table 6). Comparing jlwLS to fioLS-, we see that 
the inverse-propensity weighted estimate of [3 effec- 
tively corrects the bias from a wrong y-model if the 
vr-model is correct, but makes matters worse if the 
TT-model is wrong. 

Once again, many variations on (10) are possi- 
ble. Normalizing the POP weights tiTi^^ has no ef- 
fect on 13 WLS ^ but one might consider switching to 
NR weights. Applying NR weights and averaging the 
predicted values of xj f5 among nonrespondents pro- 
duces an estimate of which can then be com- 
bined with y^^) to produce another estimate of ^. 
Another possibility is to coarsen the weights into, 
say, five categories and compute a yr-stratified esti- 
mate of /?. 

3.3 Regression Estimation with 
Propensity-Based Covariates 

A third general strategy for constructing a DR 
estimate is to incorporate functions of estimated 
propensities into the y-model as covariates. Ordi- 
nary regression estimates are based on the relation- 
ship 



E{yi I Xi)dP{xi) 



and achieve consistency if E{yi \ Xi) = E{yi \ Xi, 
ti = 1) is consistently estimated for all Xj. In prac- 



O J 



t (V 




linear predictor 

Fig. 4. Scatterplot of raw residuals from y -model against 
linear predictors from a Tv-model, with least-squares and local 
polynomial fit. 



tice, creating a model that gives unbiased predic- 
tions for yi over the whole covariate space can be a 
daunting task, because real data often exhibit non- 
linearities, interactions, etc. that are difficult to iden- 
tify and portray, especially as the dimension of Xi 
grows. From (5), however, we see that requiring un- 
biased prediction for all Xi is much stronger than 
necessary; it would suffice to have unbiased pre- 
diction of E{yi I 7r{xi)) = E{yi \ 7r(xi),tj = 1) for all 
7r(xj) € (0, 1). If we want to repair the bias in a para- 
metric y-model, it makes sense to first identify and 
correct for lack of fit in the direction of the propen- 
sity score, because vTj = 7r(xi) is the coarsest sum- 
mary of Xi that makes yi and ti conditionally inde- 
pendent (Rosenbaum and Rubin, 1983). 

Consider the sample of n = 200 units from our 
artificial population that we examined in Section 
1.4. Figure 4 shows the residuals ii from the lin- 
ear regression of yi on Xj, plotted against the lin- 
ear predictors r/,, from the logistic regression of ti 



Table 6 

Performance of regression estimators with inverse-propensity weighted coefficients 
over 1000 samples from the artificial population 



Sample size 


TT-model 


y-model 


Method 


Bias 


% Bias 


RMSE 


MAE 


(a) n = 200 


Correct 


Correct 


WLS 


-0.09 


-3.4 


2.48 


1.68 






Incorrect 


WLS 


0.38 


13.2 


2.88 


1.92 




Incorrect 


Correct 


WLS 


-0.08 


-3.4 


2.48 


1.68 






Incorrect 


WLS 


-2.20 


-70.0 


3.83 


2.74 


(b) n = 1000 


Correct 


Correct 


WLS 


0.00 


-0.1 


1.17 


0.78 






Incorrect 


WLS 


0.16 


12.0 


1.35 


0.92 




Incorrect 


Correct 


WLS 


0.00 


-0.1 


1.17 


0.78 






Incorrect 


WLS 


-2.99 


-203.6 


3.33 


2.98 
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Table 7 

Performance of propensity- covariate (four dummy indicators) regression estimators 
over 1000 samples from the artificial population 



Sample size 


TT-model 


y-model 


Method 


Bias 


% Bias 


RMSE 


MAE 


(a) n = 200 


Correct 


Correct 


TT-COV 


-0.09 


-3.4 


2.48 


1.69 






Incorrect 


TV-COV 


-0.39 


-13.5 


2.93 


2.00 




Incorrect 


Correct 


TV-COV 


-0.09 


-3.4 


2.48 


1.68 






Incorrect 


TT-COV 


-1.27 


-38.6 


3.51 


2.43 


(b) n = 1000 


Correct 


Correct 


TT-COV 


0.00 


-0.1 


1.17 


0.79 






Incorrect 


TT-COV 


-0.55 


-42.1 


1.41 


0.87 




Incorrect 


Correct 


TT-COV 


0.00 


-0.2 


1.17 


0.79 






Incorrect 


TT-COV 


-1.49 


-100.6 


2.10 


1.56 



on Xi, for the n*-^-* = 100 responding units. A least- 
squares line fit to this plot has an intercept and 
slope of zero, because the predictor is a perfect linear 
combination of covariates already in the y-model. A 
smooth curve created by a local polynomial (loess) 
fit, however, suggests that predictions from the y- 
model tend to be slightly low in the middle of the 
propensity scale and slightly high at the extremes. 
The bias could be corrected by fitting a general- 
ized additive model (Hastie and Tibshirani, 1990) 
that allows the mean of yi to vary smoothly with 
TTj in a nonparametric fashion. Little and An (2004) 
incorporated a smoothing spline hased on fji and 
demonstrated that the resulting regression estimate 
of n is DR in the following sense: If the y-model 
correctly describes E{yi \ Xi) before the propensity- 
related terms are added, these additional terms (or 
any other functions of Xj) merely cause the model 
to be overfitted and add mean-zero noise to the pre- 
dicted values. If the vr-model is correct, then (5) 
guarantees consistent estimation of as long as the 
mean of yi varies smoothly with vTj and this rela- 
tionship can be arbitrarily well approximated by the 



Table 8 

Performance of inverse-propensity covariate regression estimators over 1000 samples from the 

artificial population 



Sample size 


TT-model 


j/-model 


Method 


Bias 


% Bias 


RMSE 


MAE 


(a) n = 200 


Correct 


Correct 


I/ti-cov 


-0.09 


-3.7 


2.48 


1.69 






Incorrect 


I/tt-cou 


1.66 


37.6 


4.70 


2.84 




Incorrect 


Correct 


I/tt-cou 


56.5 


3.1 


1804 


1.76 






Incorrect 


I/ti-cov 


-4236 


-3.4 


1.3 X 10^ 


7.79 


(b) n = 1000 


Correct 


Correct 


1/n-cov 


0.00 


-0.1 


1.17 


0.78 






Incorrect 


l/n-cov 


0.59 


31.3 


1.97 


1.26 




Incorrect 


Correct 


1/tt-cov 


-50.4 


-3.2 


1593 


0.83 






Incorrect 


1/tt-cov 


-7527 


-3.3 


2.3 X 10^ 


5.75 



linear combination of basis functions added to the 
model. In the latter case, the propensity-related co- 
variates completely remove the bias for estimating 
/i, and any additional information provided by Xj 
merely serves to make the estimate more precise. 

In the spirit of Little and An (2004), let Si = S{fji) 
denote a vector of basis functions (e.g., a spline ba- 
sis) that can serve to approximate the relationship 
between the mean of yi and fji, the estimated lin- 
ear predictor from the vr-model. Let x* = {xf , Sjy 
denote the augmented vector of covariates, and let 

(11) ^P*={^tixlxf^ (^Ux*y^ 

denote the OLS-estimated coefficients from the aug- 
mented y-model. If Si is a spline basis of degree 
A; > 1, the matrix inverse in (11) will not exist, be- 
cause Si will contain a constant and a linear function 
of r/j, which are themselves linear functions of Xj. 
Problems of collinearity can be alleviated by switch- 
ing to a generalized inverse or by removing the of- 
fending terms from Si] either approach leads to the 
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same predicted values m* = x*'^ (3*. The propensity- 
covariate regression estimate for ^ is then 

(12) fiw-cov = -y2'^i- 

A particular case of (12) was proposed by Scharf- 
stein, Rotnitzky and Robins (1999) who took Si = 
TTj"^. Using the inverse-propensity as a single addi- 
tional covariate is sufficient to achieve double ro- 
bustness, because the estimate then becomes the so- 
lution to an AIPW estimating equation (Bang and 
Robins, 2005). 

We tried (12) in our simulated example with a va- 
riety of spline bases: a quadratic spline with a single 
knot at the median of , a linear spline with knots at 
the quartiles, and so on. We found that these polyno- 
mial splines occasionally produced erratic predicted 
values of yi for nonrespondents with low propensi- 
ties, driving up the variance of the regression esti- 
mate. The best performance was achieved by sim- 
ply coarsening the f]i^ into five categories and cre- 
ating four dummy indicators to distinguish among 
them. In other words, we approximated the relation- 
ship between the mean of yi and TTj by a piecewise- 
constant function with discontinuities at the sample 
quintiles of TTj. The performance of this estimate is 
summarized in Table 7. It performs better than any 
of the other DR methods when the vr-model and y- 
model are both incorrect. It performs better than 
any method based on a vr-model alone. Yet it is still 
inferior to the simple OLS regression estimate under 
the incorrect y-model. 

In contrast, the regression estimate of Scharfstein, 
Rotnitzky and Robins (1999) that uses the inverse- 
propensity as a covariate behaves poorly under a 
misspecified vr-model (Table 8). The performance 
of this method is disastrous when some of the es- 
timated propensities are small. 

4. DISCUSSION 

Double robustness is an interesting theoretical 
property that can arise in many ways. It does not 
necessarily translate into good performance when 
neither model is correctly specified. Some DR esti- 
mators have been known to survey statisticians since 
the late 1970s. In survey contexts, these methods are 
not thought of as doubly robust but simply as ro- 
bust, because the propensities are a known feature of 
the sample design. When propensities are unknown 



and must be estimated, care should be taken to se- 
lect an estimator that is not overly sensitive to mis- 
specification of the propensity model. 

No single example can effectively represent all 
missing-data problems that researchers will see in 
practice. We constructed our simulation to vaguely 
resemble a quasi-experiment to measure the effect of 
dieting on body mass index (BMI) in a large sample 
of high-school students. The study has a pre-post 
design. Covariates Xi measured at baseline include 
demographic variables, BMI, self-perceived weight 
and physical fitness, social acceptance and person- 
ality measures. The treatment tj is dieting (0 = yes, 
1 = no) and the outcome yi is BMI one year later. 
The goal is to estimate an average causal effect of 
dieting among those who actually dieted. For that 
purpose, it suffices to treat the dieters as nonrespon- 
dents, set their BMI values to missing, and apply 
a missing-data method to estimate what the mean 
BMI for this group would have been had they not 
dieted. A simple linear regression of yi on Xi among 
the nondieters yielded an i?^ of 0.81, just as in our 
simulated data, and boxplots of the linear predic- 
tors from a logistic propensity model looked similar 
to those of Figure 3(e). The large degree of over- 
lap in the distributions of estimated propensities for 
the two groups makes a causal analysis seem feasi- 
ble. The estimated propensities for some nondieters 
are very small. Keeping these nondieters in the anal- 
ysis is highly desirable, because their covariate val- 
ues closely resemble those of dieters; they provide 
excellent proxy information for predicting the miss- 
ing values. Yet we found no obvious way to use these 
cases in an inverse-propensity weighted estimate, be- 
cause they exerted too much influence. 

Our simulation represents a situation where selec- 
tion bias is moderate, good predictors of yi are avail- 
able, both models are approximately but not exactly 
true, and some estimated propensities are nearly 
zero. In situations like these, a DR procedure that 
does not rely on inverse-propensity weighting may 
perform reasonably well, but there is no guarantee 
that it will outperform a procedure based solely on 
a y-model. A model that predicts yi reasonably well 
can enhance the performance of an approximate vr- 
model. But in our simulations, we found no way to 
use the fitted propensities from the approximate vr- 
model to reduce the bias from the approximate y- 
model. In every case, the bias correction applied to 
A OLS tended to move the estimate in the wrong di- 
rection. 
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Some might argue that inference about E{yi) in 
the full population should not be attempted in a 
situation like the one shown in Figure 3(e), where 
the estimated propensities for a few nonrespondents 
fall outside the range of those seen among the re- 
spondents. In our opinion, requiring the empirical 
support of P(7rj | tj = 1) to completely cover that 
of P(-7rj I = 0) is too stringent, especially given the 
sensitivity of these ranges to minor changes in the vr- 
model. In all 1000 samples of n = 200 and n = 1000, 
the distributions of the estimated propensities over- 
lapped sufficiently to compute a stratified estimate 
with five quintile-based groups. Moreover, although 
some of the estimated propensities were very small, 
none of the true propensities were actually zero, so 
the conditions required for DR estimation were not 
violated. Small propensities frequently occur when 
missing data are not missing by design, and data 
analysts need guidance on how to deal with them. 
One who would argue for the routine use of any kind 
of inverse-propensity weighted estimator ignores the 
obvious fact that these estimators cannot be used 
routinely. 

Other evaluations of DR methods under dual mis- 
specification have yielded mixed results, because the 
nature of the problems and degree of misspecifica- 
tion have varied. Davidian, Tsiatis and Leon (2005) 
presented a DR procedure analogous to fiBC-OLS for 
a pre~post analysis of a randomized clinical trial 
with dropout. Schafer and Kang (2005) evaluated 
their procedure and found that it performed slightly 
better than a parametric method based on multiple 
imputation (Rubin, 1987). In that analysis, each of 
the two treatment groups had its own y-model with 
R^'s of about 0.5. Each group also had its own vr- 
model, and the smallest fitted propensities were 0.14 
and 0.07. It thus appears that the use of AIPW es- 
timating equations can produce modest gains over a 
parametric method when the predictive power of the 
y-model is not too strong and the estimated propen- 
sities do not get close to zero. 

The only other simulations that we know of that 
compare the performance of DR and non-DR meth- 
ods under dual misspecification are those of Bang 
and Robins (2005). Their first example pertains to 
the estimation of a population mean /i. They com- 
pare the performance of three methods — the unnor- 
malized IPW estimate n~^J2i'ti'^i^^yi^ the ordinary 
regression estimate J2i $j the propensity- 
covariate estimate (12) that augments the y-model 
with l/vTi — when the vr- and y-models are correct 



and incorrect. In that example, the predictive abil- 
ity of the correct y-model among the respondents is 
very strong (i?^ = 0.94), but the incorrect version is 
worthless {R^ = 0.001); thus it maximally punishes 
the simple regression method when the y-model is 
misspecified. This approximates a situation where 
an analyst wants to impute missing values but has 
no idea how to use the covariates, so he simply ig- 
nores them and replaces all the missing values with 
yl^^ ■ It may also represent a situation where all of 
the confounders are hidden from the analyst for pur- 
poses of y-modeling (though not, strangely, for pur- 
poses of vr-modeling) . Another noteworthy feature 
of that example is that all of the covariates in the 
propensity model have been dichotomized and the 
selection mechanism is weak; when n is large, all of 
the estimated propensities fall nicely between 0.25 
and 0.5. Bang and Robins (2005) present three addi- 
tional examples to demonstrate the performance of 
DR estimates in more elaborate problems. In each 
case, all of the predictors in their vr-models were 
dichotomized, which helps to keep the estimated 
propensities away from zero. Despite these features, 
none of their examples supports the claim that a DR 
method based on two incorrect models is clearly and 
simultaneously better than an IPW procedure based 
on an incorrect vr-model and a simple imputation 
procedure based on an incorrect y-model. Only in 
the fourth example did the DR method outperform 
both of its competitors under dual misspecification, 
and even there the advantage was slight. Thus the 
results of Bang and Robins (2005) support our con- 
tention that two wrong models are not necessarily 
better than one. 
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